Abstract

This project aims to investigate various factors associated with motor vehicle collisions and their relationships with the number and severity of collisions. The analysis was conducted using car crash data obtained from the New York City Police Department, which contains approximately 1.98 million collision records. The highest number of recorded collisions occurred between 2016-2018, with approximately 225,000 collisions occurring in each of these years. While the overall number of collisions seems to decrease from 2018-2022, the number of fatality or injury-inducing collisions increased by approximately 18% from 2018 to 2022.

Brooklyn recorded the highest number of collisions across all years, accounting for about 31-35% of collisions annually. Driver inattention or distraction was found to be the primary contributing factor for collisions, with approximately 31% of recorded reasons attributed to a driver not paying attention or being distracted and 416,287 collisions being caused by one or more drivers who were distracted. Alcohol abuse was identified as the main contributing factor for car crashes between the hours of 2-4 AM. Furthermore, accidents involving intoxicated drivers resulted in a significantly higher number of injuries or fatalities than those involving drivers who were not under the influence of alcohol or drugs.

The median age of drivers, irrespective of gender, was found to be 40, with approximately 50% of drivers falling between the ages of 29-53. The highest number of collisions were caused by drivers in the age range of 25-40. The data set recorded more than twice the number of collisions caused by men than by women. The median age of men involved in collisions was found to be 40, whereas the median age of women involved in collisions was 38. A significant relationship was observed between gender and age with respect to collisions, with the average age of males involved in collisions being higher than the average age of females involved in collisions.

Introduction

According to a report by the National Center for Health Statistics from the CDC, accidents are the fourth leading cause of death in New York State. Among those are included fatalities due to motor vehicle collision related injuries. As someone currently working for a personal injury attorney, I see many cases come in which involve collisions. In fact, most of the cases that come our way are motor vehicle accidents in which our clients sustained numerous injuries. In 2016, Vision Zero was created by the City of New York as a citywide traffic safety initiative aimed at reducing and eliminating serious crashes and crash related fatalities by collecting and gaining insights from traffic accident data. All police officers are required to fill out an MV-104 form at the scene of an accident, and these forms are now able to be filled out electronically and automatically stored in the Police Department’s crime data warehouse.

For this project, I am curious to see what are the main factors of motor vehicle collisions. I also want to understand which factors contribute to the likelihood and severity of motor vehicle collisions in New York City. Specifically, I will be focusing on factors such as age, gender, and driver intoxication, and the relationship that these factors have with the number and severity of collisions.

Loading and Cleaning the Data

The Data

Data is collected by the NYPD and is available online through the NYC Vision Zero Open Data page. For the purpose of this project, data was downloaded from the “Motor Vehicle Collisions – Crashes”, “Motor Vehicle Collisions – Vehicles”, and “Motor Vehicle Collisions – Person” links. Data was downloaded in CSV format and loaded into R.

City of New York. Retrieved April 29, 2023, from:

  1. https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95 (crashes)
  2. https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Vehicles/bm4k-52h4 (vehicles)
  3. https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Person/f55k-p6yu (person)

Each observation in the crashes data set corresponds with a crash that occurred, each with a unique collision_id. This can be matched to the collision_id in the vehicles and person data sets.

Loading Data from PostgreSQL Server

con <- dbConnect(
  Postgres(), 
  host = "localhost", 
  port = 5432,
  user = "postgres",
  password = Sys.getenv("SQL_DB_PASS"), 
  dbname = "cuny-sps",
)

crashes_data <- dbGetQuery(con, "SELECT * FROM project.crashes")
vehicles_data <- dbGetQuery(con, "SELECT * FROM project.vehicles")
person_data <- dbGetQuery(con, "SELECT * FROM project.person")

Data Cleaning

crashes <- crashes_data |>
  mutate(zip_code = as.numeric(zip_code)) |>
  select(-location, -on_street_name, -cross_street_name, -off_street_name) |>
  arrange(collision_id)

vehicles <- vehicles_data |> 
  select(-vehicle_id, -vehicle_make, -vehicle_model, -vehicle_year, -travel_direction, -driver_license_juristiction, -point_of_impact, -vehicle_damage, -vehicle_damage_1, -vehicle_damage_2, -vehicle_damage_3, -public_property_damage, -public_property_damage_type, -contributing_factor_1, -contributing_factor_2, "vehicle_id" = unique_id)

person <- person_data |>
  select(-person_id, -person_injury, -ejection, -emotional_status, -bodily_injury, -safety_equipment, -complaint, -ped_location, -ped_action, -ped_role, -contributing_factor_1, -contributing_factor_2, "person_id" = unique_id) |>
  mutate(vehicle_id = as.numeric(vehicle_id))

Creating a Drivers Data Set

The person data set contains all people involved in a crash. I want to determine what factors relating to the driver of the vehicle contribute to collisions. For this, I will create a subset of the data called drivers which only includes those indicated to be the drivers of vehicles involved in accidents.

The data set dictionary for the person data set indicates that age is automatically calculated by the system based on birth date. For some reason, this data set has values for ages way beyond and below what should be expected (from -999 to 9999). For this reason, I will limit my drivers data set to only include drivers that fall within a plausible age range. One of the limitations in filtering the data is that there may be some ages in this data set which are over or under represented due to incorrect input of the data.

drivers <- person |>
  filter(person_type == "Occupant", 
         position_in_vehicle == "Driver",
         person_age > 15, person_age < 87) |>
  select(-person_type, -position_in_vehicle)

Now I will join this to the vehicle and crashes data sets to create a data set of all crashes relative to the drivers of the vehicles.

crashes_by_driver <- vehicles |>
  select(-collision_id, -crash_date, -crash_time) |>
  right_join(drivers, by = "vehicle_id") |>
  filter(!is.na(vehicle_id)) |>
  select(-crash_date, -crash_time) |>
  left_join(crashes, by = "collision_id") |>
  filter(!is.na(collision_id)) |>
  select(collision_id, crash_date, crash_time, borough, person_age, person_sex, vehicle_type, driver_license_status, number_persons_injured, number_persons_killed) |>
  arrange(collision_id)

Data Set of Fatal/Injury Inducing Accidents

fatalities_and_injuries <- crashes |>
  filter(number_persons_injured > 0 | number_persons_killed > 0)

Preview the Data

Crashes

rmarkdown::paged_table(crashes)

Vehicles

rmarkdown::paged_table(vehicles)

Person

rmarkdown::paged_table(person)

Drivers

rmarkdown::paged_table(drivers)

Crashes by Driver

rmarkdown::paged_table(crashes_by_driver)

Injuries and Fatalities

rmarkdown::paged_table(fatalities_and_injuries)

Exploratory Data Analysis

Main Contributing Factors

To determine the contributing factors of collisions, I will use the contributing_factor_vehicle_x columns from the crashes data set. The data set will have to be transformed to a long format for this.

contributing_factors_by_crash <- crashes |>
  select(collision_id, crash_date, crash_time, borough,  c(contributing_factor_vehicle_1:contributing_factor_vehicle_5))

contributing_factors <- contributing_factors_by_crash |>
  pivot_longer(col = c(contributing_factor_vehicle_1:contributing_factor_vehicle_5),
               names_to = "vehicle",
               values_to = "factor") |>
  mutate(factor = replace(factor, str_detect(factor, "Cell Phone"), "Cell Phone"),
         factor = replace(factor, str_detect(factor, "Drugs"), "Drugs"),
         factor = replace(factor, str_detect(factor, "Ill"), "Illness"),
         factor = replace(factor, str_detect(factor, "Uninvolved Vehicle"), "Reaction to Uninvolved Vehicle")) |>
  filter(!is.na(factor), !factor %in% c("Unspecified", "1", "80"))

contributing_factors |>
  count(factor) |>
  arrange(desc(n)) |>
  mutate(prop = n / sum(n)) |>
  head(10) |>
  ggplot(aes(x = prop, y = reorder(factor, n))) +
  geom_bar(stat="identity", fill = "steelblue") +
  scale_x_continuous(label = scales::percent) +
  labs(title = "Top 10 Contributing Factors", x = "Percentage of Crashes", y = "Factor")

The main contributing factor in motor vehicle collisions is driver inattention/distraction, which accounts for about 31% of the causes for collisions within this data set. Other contributing factors include failure to yield to right-of-way, following too closely, and other driver action related factors, as well as driver fatigue.

contributing_factors |>
  select(-vehicle) |>
  unique() |> # remove replicated crashes (i.e. multiple drivers with same causal factor)
  count(factor) |>
  arrange(desc(n)) |>
  head(1)
## # A tibble: 1 × 2
##   factor                              n
##   <chr>                           <int>
## 1 Driver Inattention/Distraction 416287

416,287 collisions in this data set were caused by one or more drivers being inattentive or distracted.

What is the main contributing factor for crashes occurring at each hour of the day?

Since driver inattention seems to be the most overwhelming factor in this data set, I will filter that out as the top factor and will check to see the second most common factor for each hour.

contributing_factors |> 
  filter(!str_detect(factor, "Inattention/")) |>
  mutate(hour = hour(crash_time)) |>
  group_by(hour) |>
  count(factor) |>
  filter(n == max(n)) |>
  knitr::kable(col.names = c("Hour", "Factor", "Number of Collisions"))
Hour Factor Number of Collisions
0 Following Too Closely 4003
1 Other Vehicular 1782
2 Alcohol Involvement 1590
3 Alcohol Involvement 1609
4 Alcohol Involvement 1826
5 Following Too Closely 1425
6 Following Too Closely 3006
7 Following Too Closely 4448
8 Failure to Yield Right-of-Way 8204
9 Failure to Yield Right-of-Way 7141
10 Failure to Yield Right-of-Way 6426
11 Failure to Yield Right-of-Way 6571
12 Failure to Yield Right-of-Way 7383
13 Failure to Yield Right-of-Way 7991
14 Failure to Yield Right-of-Way 9244
15 Following Too Closely 9122
16 Failure to Yield Right-of-Way 10321
17 Failure to Yield Right-of-Way 10737
18 Failure to Yield Right-of-Way 9454
19 Failure to Yield Right-of-Way 7712
20 Failure to Yield Right-of-Way 6259
21 Failure to Yield Right-of-Way 5069
22 Failure to Yield Right-of-Way 4183
23 Failure to Yield Right-of-Way 3117

Between the hours of 2-4 AM, driver inebriation from alcohol involvement is the main factor contributing to collisions. For all other hours, following too closely or failure to yield to right of way is the main contributing factor.

How many collisions are caused by alcohol or drug involvement?

Overall

contributing_factors |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  select(-vehicle) |>
  unique() |> # remove duplicate crashes
  nrow()
## [1] 23486

There are 23,486 incidents of alcohol or drug involvement in car accidents recorded in this data set.

How many of these accidents resulted in fatalities or serious injury?

intox_fatalities_injuries <- fatalities_and_injuries |>
  pivot_longer(col = c(contributing_factor_vehicle_1:contributing_factor_vehicle_5),
               names_to = "vehicle",
               values_to = "factor") |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  select(-vehicle) |>
  unique() 

intox_fatalities_injuries |>
  nrow()
## [1] 7486
intox_fatalities_injuries |>
  mutate(injury = ifelse(number_persons_injured > 0, 1, 0),
         fatality = ifelse(number_persons_killed > 0, 1, 0)) |>
  summarize(total_injury_inducing = sum(injury),
            total_fatality_inducing = sum(fatality))
## # A tibble: 1 × 2
##   total_injury_inducing total_fatality_inducing
##                   <dbl>                   <dbl>
## 1                  7439                     117

There are 7,486 fatality/injury inducing accidents caused by alcohol/drug involvement of one or more drivers, 7,439 of which caused severe injuries and 117 of which resulting in death.

Per Year (Table)

contributing_factors |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  count(year = year(crash_date)) |>
  knitr::kable(col.names = c("Year", "Number of Collisions Involving Intoxicated Drivers"))
Year Number of Collisions Involving Intoxicated Drivers
2012 856
2013 1760
2014 2261
2015 2201
2016 3033
2017 2848
2018 2577
2019 2389
2020 1741
2021 1924
2022 1953
2023 568

Per Year (Graph)

contributing_factors |>
  filter(str_detect(tolower(factor), "drugs|alcohol")) |>
  select(-vehicle) |>
  unique() |>
  count(year = as.factor(year(crash_date))) |> 
  ggplot(aes(x = year, y = n)) +
    geom_bar(stat="identity", fill = "steelblue4") +
    labs(title = "Number of Collisions Involving Intoxicated* Drivers", x = "Year", y = "Number of Accidents", caption = "*Drugs or Alcohol") +
    geom_text(aes(label = n), vjust = -0.5, size=3)

The most recorded collisions from alcohol/drug involvement was in 2016, with 3,033 accidents occurring in relation to intoxicated drivers. Since Vision Zero was only implemented in 2016, the number of collisions from 2012-2015 may not be complete so these numbers may not accurately reflect the true number of crashes caused by alcohol/drug involvement. Likewise, the data for 2023 is not complete, as the data was accessed about five months into the year and only accounts for collisions within those five months. Since the implementation of Vision Zero in 2016, it seems that collisions resulting from drug or alcohol intoxication has decreased till 2020 and then started to increase again slightly. The major dip from 2019 to 2020 may have been caused by stay-at-home policies due to Covid-19 resulting in less drivers overall on roads during this time.

Collisions by Year

As we are currently in the middle of 2023, the data collection for this year is incomplete and I will limit the data till 2022.

crashes_till_22 <- crashes_by_year <- crashes |>
  mutate(year = year(crash_date)) |>
  filter(year < 2023)

crashes_by_year <- crashes_till_22 |>
  count(year) 

crashes_by_year |>
  ggplot(aes(x = as.factor(year), y = n)) +
  geom_bar(stat="identity", fill = "plum4") +
  labs(title = "Collisions by Year (2012-2022)", x = "Year", y = "Number of Crashes") +
  scale_y_continuous(label = scales::comma)

There is an increase in the number of collisions until 2016 and then a decrease after 2018. The most collisions occurred in 2016-2018, with about 225,000 collisions occurring in each of these years. In 2020, the number of collisions decreased by almost 50% from the previous year, probably owing to stay-at-home orders due to Covid-19.

Fatalities and Injuries from Crashes

injuries_and_fatalities <- crashes_till_22 |>
  mutate(year = year(crash_date),
         injury = ifelse(number_persons_injured > 0, 1, 0),
         death = ifelse(number_persons_killed > 0, 1, 0)) |>
  group_by(year) |>
  summarize(year_involve_injury = sum(injury, na.rm=T),
            year_involve_death = sum(death, na.rm=T)) |>
  mutate(total_collisions = crashes_by_year$n,
         prop_injured = year_involve_injury / total_collisions,
         prop_killed = year_involve_death/ total_collisions) |>
  ungroup() |>
  pivot_longer(cols = c(prop_injured:prop_killed),
               names_to = "severity",
               values_to = "prop") |>
  mutate(severity = toupper(str_extract(severity, "killed|injured"))) 

injuries_and_fatalities |>
  ggplot(aes(x = as.factor(year), y = prop, fill = severity)) +
  geom_bar(stat="identity") +
  labs(title = "Collisions Involving Injury/Death per Year (2012-2022)", x = "Year", y = "Percentage of Crashes Involving Injury/Death") +
  scale_y_continuous(label = scales::percent) +
  geom_text(aes(label = paste0(round(prop*100, 2),'%')), vjust = -0.5, size=3)

This graph seems to follow the opposite trend of the number of collisions per year. While the number of collisions seems to have decreased from 2018-2022, the percentage of collisions involving fatalities and/or injuries has increased, with almost 40% of collisions in 2022 involving injuries and about 0.3% of crashes involving deaths.

Collisions by Borough

crashes_till_22 |>
  filter(!is.na(borough) & year >= 2017) |>
  group_by(year) |>
  count(borough) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = borough, y = prop)) +
  geom_bar(stat="identity", fill = "steelblue") +
  facet_wrap(~year) + 
  labs(title = "Collisions by Borough (2017-2022)", x = "Borough", y = "Percentage of Crashes", caption = "*Perentage out of collisions where borough is known (i.e. not NA)") +
  scale_y_continuous(label = scales::percent) +
  geom_text(aes(label = paste0(round(prop*100, 0),'%')), size = 3) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Brooklyn and Queens have consistently had the highest percentage of collisions annually, with Brooklyn making up between 31-35% of collisions and Queens making up about 27-29% of collisions per year from 2017-2022.

Collisions by Age

Vision Zero began in 2016, with better methods for collecting and storing data for collision reports. Therefore, I will limit the data to crashes from 2016-2022.

drivers_2016_2022 <- drivers |>
  mutate(year = year(crash_date)) |>
  filter(year >= 2016 & year <= 2022)

hist(drivers_2016_2022$person_age, 
     main="Distribution of Ages", 
     xlab = "Driver Age")

The distribution of driver’s ages in the data set is skewed to the right, with the most collisions being caused by drivers in the 25-40 range.

summary(drivers_2016_2022$person_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   29.00   40.00   41.73   53.00   86.00

The median age of drivers in this data set is 40 years old, and the middle 50% of data is contained within the 29-53 range.

Collisions by Gender

Overall

drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F")) |>
  count(person_sex) |>
  ggplot(aes(x = n, y = person_sex, fill = person_sex)) +
  geom_bar(stat="identity") +
  labs(title = "Collisions by Gender (2016-2022)", x = "Number of Collisions", y = "Gender") +
  theme(legend.position="none") +
  scale_x_continuous(labels = scales::comma)

Overall, there seems to be a greater number of collisions caused by males than by females.

Per Year

drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F")) |>
  group_by(year) |>
  count(person_sex) |>
  ggplot(aes(x = n, y = person_sex, fill = person_sex)) +
  geom_bar(stat="identity") +
  facet_wrap(~year) +
  labs(title = "Collisions by Gender (2016-2022)", x = "Number of Collisions", y = "Gender") +
  theme(legend.position="none") +
  scale_x_continuous(labels = scales::comma)

Annually, the number of collisions caused by female drivers is less than half the amount of collisions caused by male drivers.

Collisions by Age and Gender

drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F")) |>
  ggplot(aes(x = person_age, y = person_sex, fill = person_sex)) +
  geom_boxplot() +
  labs(title = "Distribution of Ages by Sex", x = "Age", y = "Sex") +
  theme(legend.position = "none")

Both plots have a skew to the right, with more collisions being caused by younger people of both genders. There does seem to be a relationship between age, gender, and collisions. There seems to be a higher distribution of ages for males involved in accidents than females involved in accidents. However, it is hard to tell from the box plots alone whether this relationship is significant.

males_2016_2022 <- drivers_2016_2022 |>
  filter(person_sex == "M")

females_2016_2022 <- drivers_2016_2022 |>
  filter(person_sex == "F")

summary(males_2016_2022$person_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   30.00   40.00   42.09   53.00   86.00
summary(females_2016_2022$person_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   29.00   38.00   40.71   51.00   86.00

The median age for men involved in collisions is 40, with 50% of men involved in collisions being between the ages of 30-53.

The median age for women involved in collisions is 38, with 50% of women involved in collisions being between the ages of 29-51.

Inference

Based on the exploratory data analysis, I would like to determine the extent to which alcohol/drug involvement plays a role in the severity of car crashes, and I would also like to examine if there is a relationship between age and gender and their combined contribution to the number of collisions.

Intoxication vs. Collision Severity

First, let’s look at the relationship between driver intoxication and collision severity.

Question: Is there a significant difference in the number of people injured or killed in motor vehicle collisions involving alcohol/drug-impaired drivers vs. those not involving alcohol-impaired drivers?

For this, I will be using a two-sample t-test to compare the mean number of injuries/deaths caused by car crashes involving driver intoxication vs. not involving driver intoxication.

\(H_0: \mu_1 = \mu_2\) (There is no significant difference in the number of people injured or killed in motor vehicle collisions involving alcohol/dug-impaired drivers and those not involving alcohol-impaired drivers)

\(H_A: \mu_1 \neq \mu_2\) (There is a significant difference in the number of people injured or killed in motor vehicle collisions involving alcohol/drug-impaired drivers and those not involving alcohol-impaired drivers)

Where \(\mu_1\) is the mean number of injuries/fatalities caused by intoxicated drivers and \(\mu_2\) is the mean number of injuries/fatalities caused by non-intoxicated drivers.

I will conduct this test twice for accidents resulting in injuries and accidents resulting in deaths.

Since Vision Zero only started in 2016 and the data for 2023 is not yet complete, I will filter the data to only collisions between 2016 and 2022. I am also going to include the use of prescription medication within the realm of drug and alcohol use.

collision_severity <- crashes |>
  filter(year(crash_date) >= 2016,
         year(crash_date) <= 2022) |>
  select(collision_id, number_persons_injured, number_persons_killed) |>
  mutate(injured_or_killed = number_persons_injured + number_persons_killed)

alcohol_drugs <- contributing_factors |>
  filter(year(crash_date) >= 2016,
         year(crash_date) <= 2022,
         str_detect(tolower(factor), "drugs|alcohol|medication")) |>
  pivot_wider(names_from = vehicle,
              values_from = factor) |>
  left_join(collision_severity, by = "collision_id")

not_alcohol_drugs <- contributing_factors |>
  filter(year(crash_date) >= 2016,
         year(crash_date) <= 2022) |>
  subset(!collision_id %in% alcohol_drugs$collision_id) |> # any collision not related to any intoxicated driver
  pivot_wider(names_from = vehicle,
              values_from = factor) |>
  left_join(collision_severity, by = "collision_id")

Let’s take a look at the means and variances for the number of injuries caused in these accidents:

# compare means
mean(alcohol_drugs$injured_or_killed)
## [1] 0.4964938
mean(not_alcohol_drugs$injured_or_killed, na.rm=T)
## [1] 0.3356462
# compare standard deviations
sd(alcohol_drugs$injured_or_killed)
## [1] 0.9526735
sd(not_alcohol_drugs$injured_or_killed, na.rm=T)
## [1] 0.7234148
# Welch t-test to determine if the difference in means is statistically significant
t.test(alcohol_drugs$number_persons_injured, not_alcohol_drugs$number_persons_injured, alternative = "two.sided", var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  alcohol_drugs$number_persons_injured and not_alcohol_drugs$number_persons_injured
## t = 21.881, df = 17786, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1433694 0.1715823
## sample estimates:
## mean of x mean of y 
## 0.4917807 0.3343048

The variances for these samples are not the same so we use Welch’s t-test for the analysis. The p-value for this test is very small (< 2.2e-16), indicating that there is a significant difference in the mean number of people injured or killed in accidents involving alcohol/drugs (mean \(\approx\) 0.49) and those which do not involve alcohol/drugs (mean \(\approx\) 0.33). Based on the 95% confidence interval, we are 95% confident that the true difference in the means of people injured or killed in accidents involving inebriated drivers falls somewhere between about 0.143 and 0.172.

Age and Gender vs. Collisions

Now, let’s look at the relationship between age and gender and their joint contribution to the likelihood of collisions.

Question: Is there a significant difference in the ages of males and females involved in motor vehicle collisions?

\(H_0: \mu_1 = \mu_2\) (There is no significant difference in the ages of males and females involved in motor vehicle collisions)

\(H_A: \mu_1 \neq \mu_2\) (There is a significant difference in the ages of males and females involved in motor vehicle collisions)

To normalize the data, let’s look at drivers below the age of 67. To make the data more manageable, let’s look at just collisions for 2019.

drivers_16to67_2019 <- drivers_2016_2022 |>
  filter(person_sex %in% c("M", "F"), person_age <= 67, year == 2019) 

drivers_16to67_2019 |>
  ggplot(aes(x = person_age, y = person_sex, fill = person_sex)) +
  geom_boxplot() +
  labs(title = "Distribution of Ages by Sex in 2019 (Drivers Aged 16-67)", x = "Age", y = "Sex") +
  theme(legend.position = "none")

The distribution of ages now looks to be approximately normal. We can see that the median age for male drivers involved in collisions is higher than that of female drivers involved in collisions. Let’s compare the mean ages for both genders.

drivers_16to67_2019 |>
  group_by(person_sex) |>
  summarize(mean_age = mean(person_age))
## # A tibble: 2 × 2
##   person_sex mean_age
##   <chr>         <dbl>
## 1 F              39.3
## 2 M              40.7

There is an observed difference. But, is this difference statistically significant?

males <- drivers_16to67_2019 |>
  filter(person_sex == "M")

females <- drivers_16to67_2019 |>
  filter(person_sex == "F")

sd(males$person_age)
## [1] 13.12219
sd(females$person_age)
## [1] 12.63525
t.test(males$person_age, females$person_age, alternative = "two.sided", var.equal = T)
## 
##  Two Sample t-test
## 
## data:  males$person_age and females$person_age
## t = 26.339, df = 299898, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.315233 1.526709
## sample estimates:
## mean of x mean of y 
##  40.72104  39.30007

The p-value for this t-test is very small (< 2.2e-16), so therefore we reject the null hypothesis and say that there is a significant difference in the mean ages between males and females involved in accidents. The mean age for men is about 40.72 and the mean age for women is about 39.30. Based on the 95% confidence interval, we are 95% confident that the true difference in the mean ages for men and women involved in car accidents is between 1.315 and 1.527.

We can also visualize this significance using the null distribution.

obs_diff <- drivers_16to67_2019 |>
  specify(person_age~person_sex) |>
  calculate(stat = "diff in means", order = c("M", "F"))

null_dist <- drivers_16to67_2019 |>
  specify(person_age~person_sex) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in means", order = c("M", "F"))

visualize(null_dist) + shade_p_value(obs_stat = obs_diff, direction = "two_sided")

The observed difference is way off from what we would expect to see if there was not a statistically significant relationship. Therefore, we can conclude that there is a significant difference in the mean ages of men and women who get involved in collisions.

Conclusions

Based on the analysis of car crash data obtained from the New York City Police Department, this project identified driver inattention/distraction as the primary contributing factor for collisions, with 416,287 collisions recorded as being caused by one or more drivers who were distracted or not paying attention. Aside from this, alcohol abuse was found to be the main contributing factor for car crashes between the hours of 2-4 AM. Using a two-sample Welch t-test, the data provided evidence that the number of people injured or killed in collisions involving intoxicated drivers is significantly more than the number of people injured or killed in car crashes where there is no alcohol or drug involvement.

The analysis also revealed a significant relationship between gender and age with respect to collisions. The data also showed a higher number of collisions caused by men than by women, with the median age of men involved in collisions (40) being higher than the median age of women involved in collisions (38). Using a two-sample t-test and visualizing with the null distribution, this difference was shown to be significant, with the average age of men involved in collisions being significantly higher than the average age of women involved in accidents.

The findings of this study have important implications for policymakers and stakeholders in the transportation and law enforcement sectors, who can leverage this information to design evidence-based interventions and strategies to reduce the number of motor vehicle collisions, injuries, and fatalities in New York City and beyond. The insights generated from this study can also inform future research efforts in this field, as well as support the development of public education campaigns that promote safe driving practices and increase awareness of the risks associated with distracted and impaired driving.

Limitations

There were a few limitations I noticed while analyzing these data sets. As noted above, even though age is described as being calculated based on birth date, there are a number of ages which fall outside the range of plausibility. When creating a data set of drivers involved in these collisions, I chose to filter the data set only for people within the ages of 16 and 87 i order to improve accuracy and to be able to analyze the relationship that age may have with the number and severity of crashes. However, there may be ages that are over or under represented in this new data set due to this error and subsequent filtering.

Another limitation that I found was that the crashes data set has separate columns for the contributing factors based on each vehicle involved in a collision (contributing_factor_vehicle_x), but, there is no indication in the person data set as to which vehicle they were in (i.e. were they vehicle 1 in the MV-104 form or vehicle 2, etc). Additionally, the contributing_factor_x columns in the person data set was not as robust as the contributing factors columns in the crashes data set. For this reason, I could not compare the main contributing factors for car accidents across different genders or age groups, as I could not match up the people to the vehicles involved in the crashes.

An additional limitation is that this data is limited only to reported accidents. Accidents which were not reported to the authorities are not included in this data set and therefore the data may not be representative of all collisions in New York City as certain types of collisions or areas with low reporting rates may not be represented.